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Abstract 

We generalize Newton-type methods for minimizing smooth functions 
to handle a sum of two convex functions: a smooth function and a non- 
smooth function with a simple proximal mapping. We show that the re- 
sulting proximal Newton-type methods inherit the desirable convergence 
behavior of Newton-type methods for minimizing smooth functions, even 
when search directions are computed inexactly. Many popular methods 
tailored to problems arising in bioinformatics, signal processing, and sta- 
tistical learning are special cases of proximal Newton-type methods, and 
our analysis yields new convergence results for some of these methods. 
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1 Introduction 

Many problems of relevance in bioinformatics, signal processing, and statistical 
learning can be formulated as minimizing a composite function: 

minimize f(x) :— g(x) + h(x), (1-1) 

where g is a convex, continuously differentiable loss function, and h is a convex 
but not necessarily differentiable penalty function or regularize! - . Such problems 
include the lasso [23] , the graphical lasso [10) , and trace- norm matrix completion 

We describe a family of Newton-type methods for minimizing composite 
functions that achieve superlinear rates of convergence subject to standard as- 
sumptions. The methods can be interpreted as generalizations of the classic 
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proximal gradient method that account for the curvature of the function when 
selecting a search direction. Many popular methods for minimizing compos- 
ite functions are special cases of these proximal Newton-type methods, and our 
analysis yields new convergence results for some of these methods. 

1.1 Notation 

The methods we consider are line search methods, which means that they pro- 
duce a sequence of points {x k } according to 

Xk+i = x k + t k Ax k , 

where tk is a step length and Axk is a search direction. When we focus on one 
iteration of an algorithm, we drop the subscripts (e.g. x+ = x + tAx). All the 
methods we consider compute search directions by minimizing local quadratic 
models of the composite function /. We use an accent : to denote these local 
quadratic models (e.g. f k is a local quadratic model of / at the fc-th step). 

1.2 First-order methods 

The most popular methods for minimizing composite functions are first- order 
methods that use proximal mappings to handle the nonsmooth part h. SpaRSA 
[21)] is a popular spectral projected gradient method that uses a spectral step 
length together with a nonmonotone line search to improve convergence. TRIP 
[T5] also uses a spectral step length but selects search directions using a trust- 
region strategy. 

We can accelerate the convergence of first-order methods using ideas due 
to Nesterov [15] . This yields accelerated first-order methods, which achieve e- 
suboptimality within 0(1/ y/e) iterations 24 . The most popular method in 
this family is the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) pQ. 
These methods have been implemented in the package TFOCS [3] and used to 
solve problems that commonly arise in statistics, signal processing, and statis- 
tical learning. 

1.3 Newton-type methods 

There are two classes of methods that generalize Newton-type methods for 
minimizing smooth functions to handle composite functions Nonsmooth 
Newton-type methods [27] successively minimize a local quadratic model of the 
composite function /: 

fk(y) = f(xk)+ sup z T (y-x k ) + \(y-x k ) T H k (y-x k ), 
zedf(x k ) * 

where H k accounts for the curvature of /. (Although computing this Ax k is 
generally not practical, we can exploit the special structure of / in many sta- 
tistical learning problems.) Our proximal Newton-type methods approximates 
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only the smooth part g with a local quadratic model: 

fk(y) = g(xk) + Vg(x k ) T (y - x k ) + ~(y - x k ) T H k (y - x k ) + h(y). 

where H k is an approximation to S7 2 g(x k ). This idea can be traced back to 
the generalized proximal point method of Fukushima and Mine . Many pop- 
ular methods for minimizing composite functions are special cases of proximal 
Newton-type methods. Methods tailored to a specific problem include glmnet 
[9], LIBLINEAR [28], QUIC [12], and the Newton-LASSO method [16]. Generic 
methods include projected Newton-type methods [221121] . proximal quasi-Newton 
methods [20] [2], and the method of Tseng and Yun [25] [14]. 

There is a rich literature on solving generalized equations, monotone inclu- 
sions, and variational inequalities. Minimizing composite functions is a special 
case of solving these problems, and proximal Newton-type methods are special 
cases of Newton- type methods for these problems [17] . We refer to [18] for a uni- 
fied treatment of descent methods (including proximal Newton-type methods) 
for such problems. 

2 Proximal Newton-type methods 

We seek to minimize composite functions f(x) :— g(x) + h(x) as in (jl.lj) . We 
assume g and h are closed, convex functions, g is continuously differentiable, 
and its gradient V<? is Lipschitz continuous, h is not necessarily everywhere 
differentiable, but its proximal mapping (12.11) can be evaluated efficiently. We 
refer to g as "the smooth part" and h as "the nonsmooth part". We assume 
the optimal value /* is attained at some optimal solution x* , not necessarily 
unique. 

2.1 The proximal gradient method 

The proximal mapping of a convex function h at x is 

1 2 

prox^z) := argmin h(y) + - \\y - x\\ . (2.1) 

Proximal mappings can be interpreted as generalized projections because if h 
is the indicator function of a convex set, then prox^x) is the projection of x 
onto the set. If h is the t\ norm and t is a step-length, then prox t/l (x) is the 
soft-threshold operation: 

prox t€i (x) = sign(a;) • max{|a;| — i,0}, 

where sign and max are entry-wise, and • denotes the entry-wise product. 
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The proximal gradient method uses the proximal mapping of the nonsmooth 
part to minimize composite functions: 

Xk+i =Xk — t k G tk f(x k ) 
Gt k ji x k) :=^(x k - prox tkh (x k - t k Vg(x k ))) , 

where t k denotes the /c-th step length and Gt k f(x k ) is a composite gradient 
step. Most first-order methods, including SpaRSA and accelerated first-order 
methods, are variants of this simple method. We note three properties of the 
composite gradient step: 

1. G tk f(x k ) steps to the minimizer of h plus a simple quadratic model of g 
near x k : 

x k+1 = prox tfc/l (x k - t k Vg{x k )) (2.2) 

1 2 

= argmin t k h(y) + - \\y - x k + t k Vg(x k )\\ (2.3) 
v ^ 

= argmin Vg(x k ) T {y - x k ) + ^— \\y - x k \\ 2 + h(y). (2.4) 

y Ltk 

2. Gt k f(x k ) is neither a gradient nor a subgradient of / at any point; rather 
it is the sum of an explicit gradient and an implicit subgradient: 

G t k f(x k ) € Vg(x k ) + dh(x k+1 ). 

3. G tk f(x) is zero if and only if x minimizes /. 

The third property generalizes the zero gradient optimality condition for smooth 
functions to composite functions. We shall use the length of Gf(x) to measure 
the optimality of a point x. 

Lemma 2.1. If Vg is Lipschitz continuous with constant L\, then ||G/(a;)|| 
satisfies: 

\\G f (x)\\<(L 1 + l)\\x-x*\\. 
Proof. The composite gradient steps at x k and the optimal solution x* satisfy 

G f (x k ) e Vg(x k ) + dh{x k - G f (x k )) 
G f (x*) g Wg(x*)+dh(x*). 

We subtract these two expressions and rearrange to obtain 

dh(x k - G f (x k )) - dh{x*) 3 G f (x) - (Vg(x) - Vg(x*)). 
dh is monotone, hence 

< (x - G f {x) - x*fdh(x k - G f (x k )) 
= -Gf{x) T G f {x) +(x- x*)G f (x) + G f (x) T (\7g(x) - Vg(x*)) 
-(x-x*) T (Vg(x)-Vg(x*)). 
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We drop the last term because it is nonncgative (V<? is monotone) to obtain 

< - ||GV(*)|| 2 + (x- x*)G f (x) + G f (x) T (Vg(x) - Vg{x*)) 
< - \\G f (x)f - \\G f (x)\\ (\\x - x*\\ + \\Vg(x) Vg(x*)\\). 

We rearrange to obtain 

\\G f (x)\\ < \\x-x*\\ + \\Vg(x)-Vg(x*)\\. 

V<? is Lipschitz continuous, hence 

\\G f (x)\\<(L 1 + l)\\x-x*\\. 

□ 

2.2 Proximal Newton-type methods 

Proximal Newton-type methods use a local quadratic model (in lieu of the simple 
quadratic model in the proximal gradient method (|2.4|Q to account for the 
curvature of g. A local quadratic model of g at x k is 

9k(y) = ^g{x k ) T {x - x k ) + ^(y - x k ) T H k (y - x k ), 

where H k denotes an approximation to 'V 2 g(x k ). A proximal Newton-type 
search direction Ax k solves the subproblem 

Ax k = argmin f k (x k + d) := g k (x k + d) + h(x k + d). (2.5) 

d 

There are many strategies for choosing H k . If we choose H k to be V 2 g(x k ), 
then we obtain the proximal Newton method. If we build an approximation to 
~S/ 2 g(x k ) using changes measured in Vg according to a quasi-Newton strategy, 
we obtain a proximal quasi-Newton method. If the problem is large, we can 
use limited memory quasi-Newton updates to reduce memory usage. Generally 
speaking, most strategies for choosing Hessian approximations for Newton-type 
methods (for minimizing smooth functions) can be adapted to choosing H k in 
the context of proximal Newton-type methods. 

We can also express a proximal Newton-type search direction using scaled 
proximal mappings. This lets us interpret a proximal Newton-type search direc- 
tion as a "composite Newton step" and reveals a connection with the composite 
gradient step. 

Definition 2.2. Let h be a convex function and H , a positive definite matrix. 
Then the scaled proximal mapping of h at x is 

Proxf (x) := argmin h(y) + ]- \\y - xf H . (2.6) 
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Scaled proximal mappings share many properties with (unsealed) proximal 
mappings: 

1. prox^(a;) exists and is unique for x € domh because the proximity func- 
tion is strongly convex if H is positive definite. 

2. Let dh(x) be the subdifferential of h at x. Then prox^(.x) satisfies 

H (x — proxf (a:)) g dh (proxf (a:)) . (2.7) 

3. prox^ (x) is firmly nonexpansive in the iT-norm. That is, if u = prox^ (x) 
and v = prox^(y), then 

(u - v) T H(x -y)> \\u - v\\ H , 
and the Cauchy-Schwarz inequality implies ||u — v\\ H < \\x — y\\ H - 

We can express a proximal Newton-type search direction as a "composite 
Newton step" using scaled proximal mappings: 

Ax = proxjf (x - H^Wgix)) - x. (2.8) 

We use (|2.7[) to deduce that a proximal Newton search direction satisfies 

H (H^Vgix) - Ax) g flft(a; + Ax). 

We simplify to obtain 

HAx £ -Vg(x) - dh(x + Ax). (2.9) 

Thus a proximal Newton-type search direction, like the composite gradient step, 
combines an explicit gradient with an implicit subgradient. Note this expression 
yields the Newton system in the case of smooth functions (i.e., h is zero). 

Proposition 2.3 (Search direction properties). If H is positive definite, then Ax 
in (I2.5P satisfies 

f(x+) < f(x) + t (Vg(x) T Ax + h(x + Ax) - h(x)) + 0(t 2 ), (2.10) 
Vg(x) T Ax + h(x + Ax) - h(x) < -Ax T HAx. (2.11) 

Proof. For t g (0,1], 

f(x+) - f(x) = g(x + ) - g(x) + h(x+) - h(x) 

< g{x+) - g(x) + th(x + Ax) + (1 - t)h(x) - h(x) 

= g(x + ) - g(x) + t(h(x + Ax) - h(x)) 

= Vg(x) T (tAx) + t(h(x + Ax) - h(x)) + 0{t 2 ), 

which proves (|2.10l) . 
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Since Ax steps to the minimizer of / (|2.5I) , t Ax satisfies 
Vg(x) T Ax + ^Ax T HAx + h(x + Ax) 

< \7g(x) T (tAx) + ^t 2 Ax T HAx + h(x + ) 

< tWg(x) T Ax + ^t 2 Ax T HAx + th(x + Ax) + (1 - t)h(x). 
We rearrange and then simplify: 

(1 - t)Vg(x) T Ax + -(1 - t 2 )Ax T HAx + (1 - t)(h(x + Ax) - h(x)) < 

Wg(x) T Ax + -(1 + t)Ax T HAx + h(x + Ax) - h(x) < 

\7g{x) T Ax + h(x + Ax) - h(x) < --(1 + t)Ax T HAx. 
Finally, we let t — ^ 1 and rearrange to obtain (|2.11[) . □ 



Proposition 12.31 implies the search direction is a descent direction for / be- 
cause we can substitute (|2.1ip into (|2.10[) to obtain 

f(x + ) < f(x)-tAx T HAx + 0(t 2 ). (2.12) 

Proposition 2.4. Suppose H is positive definite. Then x* is an optimal solution 
if and only if at x* the search direction Ax (|2.5|) is zero. 



Proof. If Ax at x* is nonzero, it is a descent direction for / at x*. Hence, x* 
cannot be a minimizer of /. If Ax = 0, then x is the minimizer of /. Thus 

\7g(x) T (td) + \ 2 d T Hd + h{x + td) - h(x) > 

for all t > and d. We rearrange to obtain 

h{x + td) - h{x) > -Wg{x) T d - \d 2 d T Hd. (2.13) 

Let Df(x,d) be the directional derivative of / at x in the direction d: 

f(x + td)-f(x) 



Df{x, d) = lim 

_ g(x + td) - g(x) + h(x + td) - h(x) 

t-X) t 

= lim tVg(x) T d + Q(f) + h(x + td) - h(x) 

We substitute fl2~T3|) into (|2~T4]) to obtain 

W1 . tV 3 (x) T d + 0(t 2 )-±i 2 d T Hd-iV 5 (x) T d 
Df(x,u) > lim - 

-W\ffd+0(i 2 ) 
= lim — 2 — = 0. 

t^o t 

Since / is convex, x is an optimal solution if and only if Ax = 0. □ 
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In a few special cases we can derive a closed form expression for the proximal 
Newton search direction, but we must usually resort to an iterative method. The 
user should choose an iterative method that exploits the properties of h. E.g., 
if h is the l\ norm, then (block) coordinate descent methods combined with an 
active set strategy are known to be very efficient for these problems [3]. 

We use a line search procedure to select a step length t that satisfies a 
sufficient descent condition: 

f(x+)< f(x)+atA (2.15) 
A := Vg{x) T Ax + h(x + Ax) — h(x), (2.16) 

where a € (0, 0.5) can be interpreted as the fraction of the decrease in / pre- 
dicted by linear extrapolation that we will accept. A simple example of a line 
search procedure is called backtracking line search [4] . 

Lemma 2.5. Suppose H >z ml for some m > and Vg is Lipschitz continuous 
with constant L±. Then there exists k such that 

t < minjl,~(l- a)\ (2.17) 

satisfies the sufficient descent condition (|2.16l) . 
Proof. We can bound the decrease at each iteration by 
f(x+) - f{x) = g(x+) - g(x) + h(x+) - h(x) 

< [ Vg(x + s(tAx)) T (tAx)ds + th(x + Ax) + (1 - t)h(x) - h(x) 
Jo 

= Vg(x) T (tAx) + t(h(x + Ax) - h(x)) 
i 

(Vg(x + s(tAx)) - Wg(x)) T (tAx)ds 

< t (yg(x) T (tAx) + h(x + Ax) - h(x) 
|V 9(l + .(A*))-V 9(l )||||Ax|M, 

Since Vg is Lipschitz continuous with constant L\, 

f(x+) - f{x) < t (v g (x) T Ax + h(x + Ax) - h(x) + ^- || Ax|| 

= ^A + ^||Ax|| 2 ), (2.18) 
where we use (|2. 1 1[) . If we choose t < §(1 — ct), k = Li/m, then 

^||Aa;|| 2 <™(l-a)||Ax|| 2 
< (1 - a)Ax T HAx 

<-(l-a)A, (2.19) 
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where we again use (|2.11j) . We substitute (|2.19p into (|2 . 18[) to obtain 
f(x+) - f(x) < t (A - (1 - a)A) = t(aA). 

□ 

Algorithm 1 A generic proximal Newton-type method 

Require: starting point xq £ dom/ 
1: repeat 

2: Choose H k , a positive definite approximation to the Hessian. 
3: Solve the subproblem for a search direction: 

Ax k argmin d X7g(x k ) T d + \d T H k d + h(x k + d). 
4: Select tk with a backtracking line search. 
5: Update: Xk+i <— Xk + t k Ax k . 
6: until stopping conditions are satisfied. 



2.3 Inexact proximal Newton-type methods 

Inexact proximal Newton-type methods solve subproblem (I2.5[) approximately 
to obtain inexact search directions. These methods can be more efficient than 
their exact counterparts because they require less computational expense per 
iteration. In fact, many practical implementations of proximal Newton- type 
methods such as glmnet, LIBLINEAR, and QUIC use inexact search directions. 

In practice, how exactly (or inexactly) we solve the subproblem is critical 
to the efficiency and reliability of the method. The practical implementations 
of proximal Newton-type methods we mentioned use a variety of heuristics to 
decide how accurately to solve the subproblem. Although these methods per- 
form admirably in practice, there are few results on how inexact solutions to 
the subproblem affect their convergence behavior. 

First we propose an adaptive stopping condition for the subproblem. Then in 
section[3]we analyze the convergence behavior of inexact Newton-type methods. 
Finally, in section [4] we conduct computational experiments to compare the 
performance of our stopping condition against commonly used heuristics. 

Our stopping condition is motivated by the adaptive stopping condition used 
by inexact Newton-type methods for minimizing smooth functions: 

\\Vg k (x k + Ax k )\\ < Vk \\Vg(x k )\\ , (2.20) 

where rjk is called a forcing term because it forces the left-hand side to be small. 
We generalize (|2.20j) to composite functions by substituting composite gradients 
into (|2.20|) and scaling the norm: 

\\Vg k (x k ) + dh(x k + Ax k )\\ H -i < n k \\G f (x k )\\ H -i . (2.21) 
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Following Eisenstat and Walker [8], we set rjk based on how well g k -i approxi- 
mates g near x k : 

% = mm <^ 0.1, — — - \ . (2.22) 

I ||V. 9 (a; fc _i)|| J 

This choice yields desirable convergence results and performs admirably in prac- 
tice. 

Intuitively, we should solve the subproblem exactly if (i) x k is close to the 
optimal solution, and (ii) f k is a good model of / near x k - If (i), then we seek to 
preserve the fast local convergence behavior of proximal Newton-type methods; 
if (ii), then minimizing f k is a good surrogate for minimizing /. In these cases, 
(|2.2ip and (|2.22[) ensure the subproblem is solved accurately. 

We can derive an expression like (|2.9p for an inexact search direction in 
terms of an explicit gradient, an implicit subgradient, and a residual term r k . 
This reveals connections to the inexact Newton search direction in the case of 
smooth problems. (|2.21[) is equivalent to 

€ Vg k (x k ) + dh(x k + Ax k ) + r k , 

for some r k such that II^aHy 2 ^*)- 1 — ^ fe II^/O^IIh- 1 • Hence an inexact search 
direction satisfies 

HAx k e -Vg(x k ) - dh(x k + Ax k ) + r k . (2.23) 

3 Convergence results 

Our first result guarantees proximal Newton-type methods converge globally to 
some optimal solution x*. We assume {H k } are sufficiently positive definite; 
i.e., H k y ml for some m > 0. This assumption is required to guarantee the 
methods are executable, i.e. there exist step lengths that satisfy the sufficient 
descent condition (cf. Lemma [2~5]) . 

Theorem 3.1. If H k >z ml for some m > 0, then x k converges to an optimal 
solution starting at any Xq € dom/. 

Proof. f{x k ) is decreasing because Ax k is always a descent direction (|2.12j) and 
there exist step lengths satisfying the sufficient descent condition (|2.16p (cf. 
Lemma l2.5p : 

f(x k ) - f(x k+ i) < at k A k < 0. 

f(x k ) must converge to some limit (we assumed / is closed and the optimal 
value is attained); hence t k A k must decay to zero. t k is bounded away from 
zero because sufficiently small step lengths attain sufficient descent; hence 
must decay to zero. We use (|2.11l) to deduce that Ax k also converges to zero: 

IIAxfeU 2 < -AxlH k Ax k < --A fc . 
m m 

Ax k is zero if and only if x is an optimal solution (cf. Proposition I2.4|) . hence 
x k converges to some x* . □ 
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3.1 Convergence of the proximal Newton method 

The proximal Newton method uses the exact Hessian of the smooth part g 
in the second-order model of /, i.e. H/. = V 2 g{xk). This method converges 
g-quadratically: 

\\x k+1 -x*\\=0(\\x k -x*\\ 2 ), 

subject to standard assumptions on the smooth part: we require g to be locally 
strongly convex and V 2 g to be locally Lipschitz continuous, i.e. g is strongly 
convex and Lipschitz continuous in a ball around x*. These are standard as- 
sumptions for proving that Newton's method for minimizing smooth functions 
converges g-quadratically. 

First, we prove an auxiliary result: step lengths of unity satisfy the sufficient 
descent condition after sufficiently many iterations. 

Lemma 3.2. Suppose (i) g is locally strongly convex with constant m and 
(ii) V 2 <? is locally Lipschitz continuous with constant L^. If we choose H k = 
~S/ 2 g{xk), then the unit step length satisfies the sufficient decrease condition 
(|2.16l) for k sufficiently large. 

Proof. Since V 2 g is locally Lipschitz continuous with constant L2, 

g(x + Ax) < g(x) + \7g(x) T Ax + ^Ax T V 2 g(x)Ax + ^ || Ax|| 3 . 

We add h(x + Ax) to both sides to obtain 

f(x + Ax) < g(x) + Vg{x) T Ax + ^Ax T W 2 g(x)Ax 

+ — IIAxll 3 + h(x + Ax). 
6 

We then add and subtract h(x) from the right-hand side to obtain 

f(x + Ax) < g(x) + h(x) + Vg(x) T Ax + h(x + Ax) - h(x) 

+ -Ax T V 2 g{x)Ax + — IIAxll 3 
2 6 

< f(x) + A + \Ax T V 2 g{x)Ax + ^ ||A.t|| 3 

<f(x) + A- l -A + ^- HAxll A, 
2 6m 

where we use (|2.11[) and (12 . 16[) . We rearrange to obtain 

f{x + Ax) - fix) < 1a + l -Ax T V 2 gix)Ax - ^A \\Ax\\ 

We can show Axk decays to zero via the same argument that we used to prove 
Theorem 13. II Hence, if k is sufficiently large, /(x^ -I- Ax k ) — fixk) < jAfc. □ 
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Theorem 3.3. Suppose (i) g is locally strongly convex with constant m, and 
(ii) V 2 <7 is locally Lipschitz continuous with constant Li- Then the proximal 
Newton method converges q-quadratically to x* . 

Proof. The assumptions of Lemma 13.21 are satisfied; hence unit step lengths 
satisfy the sufficient descent condition after sufficiently many steps: 

x k+1 = x k + Ax k = prox^ 9{ - Xk) [x k - V 2 g{x k y 1 Vg(x k )) . 

prox^ 7 9 ^ Xk ^ is firmly nonexpansive in the V 2 <7(xfc)-norm, hence 

Ikfe+i - x *\\vs g {x h ) 

= || P roxf 9(Xk) (x k - V 2 </(a*) _1 V<;(a: fc )) 



< \\x k - x* + V 2 5(xfc) _1 (Vff(a!*) - Vg(x k ))\ 

< -L \\V 2 g(x k )(x k - x*) - Vg(x k ) + Vg(x*)\ 



V 2 g(x k ) 



V 2 g is locally Lipschitz continuous with constant L2', hence 

\\V 2 g(x k )(x k - x*) - Vg(x k ) +Vg(x*)\\ < -j \\x k - a;*|| 2 
We deduce that x k converges to x* quadratically: 

lkfe+l - 1*11 < 4^ IK+i " z*llv» fl (*») ^ ^ IK " ■ 



□ 

3.2 Convergence of proximal quasi-Newton methods 

If the sequence {H k } satisfy the Dennis-More criterion [7], namely 

\\(H k -V 2 g(x*))(x k+1 -x k )\\ n 

then we can prove that a proximal quasi-Newton method converges q-supcrlincarly: 

||xfc+i - < o(\\x k - x*\\). 

We also require g to be locally strongly convex and V 2 g to be locally Lipschitz 
continuous. These are the same assumptions required to prove quasi-Newton 
methods for minimizing smooth functions converge superlinearly. 

First, we prove two auxiliary results: (i) step lengths of unity satisfy the suf- 
ficient descent condition after sufficiently many iterations, and (ii) the proximal 
quasi-Newton step is close to the proximal Newton step. 
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Lemma 3.4. Suppose g is twice continuously differ entiable and V 2 g is locally 
Lipschitz continuous with constant I>2- If {Hk} satisfy the Dennis-More crite- 
rion and their eigenvalues are bounded, then the unit step length satisfies the 
sufficient descent condition (|2.16[) after sufficiently many iterations. 

Proof. The proof is very similar to the proof of Lemma 13.21 and we defer the 
details to Appendix \M □ 

The proof of the next result mimics the analysis of Tseng and Yun [25] . 

Proposition 3.5. Suppose H and H are positive definite matrices with bounded 
eigenvalues: ml < H < MI and ml < H < MI. Let Ax and Ax be the search 
directions generated using H and H respectively: 



Ax = proxjf (x - H^ 1 Vg(x)) - x, 
Ax = proxf (x - H^Vgix)^ - x. 



Then there exists such that these two search directions satisfy 



\Ax-Ax\\ < \l— —\\(H- H)Ax|| 1/2 ||Ad| 1/2 . 
V m 

Proof. By (12.51) and Fermat's rule, Ax and Ax are also the solutions to 
Ax = argmin S7g{x) T d + Ax T Hd + h{x + d), 

d 

Ax = argmin Vg(x) T d + Ax 7 Hd + h(x + d). 

d 

Hence Ax and Ax satisfy 

Vg(x) T Ax + Ax T HAx + h(x + Ax) 

< Vg(x) T Ax + Ax T HAx + h(x + Ax) 

and 

Vg(x) T Ax + Ax T HAx + h(x + Ax) 

< V.g(x) T Ax + Ax T HAx + h(x + Ax). 

We sum these two inequalities and rearrange to obtain 

Ax T HAx - Ax T (H + H)Ax + Ax T HAx < 0. 
We then complete the square on the left side and rearrange to obtain 

Ax T HAx - 2Ax T HAx + Ax T HAx 

< Ax T (H - H)Ax + Ax T (H - H)Ax. 
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The left side is ||Ax — Ai||^ and the eigenvalues of H are bounded. Thus 

1/2 

Ax|| < ( Ax T {H - H)Ax + Ax T (H - H)Aa 



m 

< -J=||(ff - iTjAx|| 1/2 (||Ax|| + \\Ax\\)^ 2 . (3.2) 
m 



We use a result due to Tseng and Yun (cf. Lemma 3 in [25]) to bound the term 
(||Ax|| + ||Ai||). Let P denote H~ l / 2 HH~ 1 / 2 . Then ||Ax|j and ||Ax|j satisfy 



lAxll • 



M 1 + A max (P) + y/1 - 2A min (P) + A max (P) 2 



2m 




We denote the constant in parentheses by 8 and conclude that 

||Ax|| + ||Ax|| <(l + 6)\\Ax\\. (3.3) 
We substitute (1531) into (l3~2t to obtain 



|Ax- Axil 2 < Jl+1\\(H- ff)Ax|| 1/2 ||Ax|| 1/2 . 
V to " 



□ 



Theorem 3.6. Suppose (i) g is twice continuously differentiable and locally 
strongly convex, (ii) V 2 g is locally Lipschitz continuous with constant L^. If 
{H k } satisfy the Dennis-More criterion and their eigenvalues are bounded, then 
a proximal quasi-Newton method converges q-superlinearly to x* . 



Proof. The assumptions of Lemma 13.41 are satisfied; hence unit step lengths 
satisfy the sufficient descent condition after sufficiently many iterations: 

x/c+i = Xk + Axk- 

Since the proximal Newton method converges g-quadratically (cf. Theorem l3.3j) . 
Hscfc+i - x*|| < \\x k +Axf - a5*j| + \\Ax k - Axf \\ 

< — \\xf - x*f + \\Ax k - Axf || , (3.4) 



where Axf denotes the proximal-Newton search direction. We use Proposition 
13.51 to bound the second term: 



||Ax fe -Axjf || < W^||(V 2 ff (x fc )-P fc )Ax fc || 1/2 ||Ax fc || 1/2 . (3.5) 

II 11 V TO 11 11 

V 2 g is Lipschitz continuous and Axfe satisfies the Dennis-More criterion; hence 

|| (V 2 .g(x fe ) - H k ) Ax k || < || {V 2 g(x k ) - V 2 g(x*)) Ax k \\ 

+ \\(V 2 g(x*)-H k )Ax k \\ 
< L 2 \\x k - x*\\\\Ax k \\ + o(\\Ax k \\). 
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||Aa;fc|| is within some constant 9 k of HAa^'H (c/. Lemma 3 in [25]), and we know 
the proximal Newton method converges g-quadratically. Thus 

||Ax fc || < e k \\&xf\\ = 6 k \\xf +1 -x k \\ 
<6k{\\xt + i-x*\\ + \W-x*\\) 
<0(\\x k ~x*\\ 2 ) +0k\\x k -x*\\. 

We substitute these expressions into (|3.5p to obtain 

\\Axk-Axf\\ = o(\\x k - x*\\). 

We substitute this expression into (13.41) to obtain 

||a*+i - <—\\xf-x*\\ 2 + o(\\xk-x*\\), 
m 

and we deduce that Xk converges to x* superlinearly. □ 

3.3 Convergence of the inexact proximal Newton method 

We make the same assumptions made by Dembo et al. in their analysis of inexact 
Newton methods for minimizing smooth functions [5]: (i) Xk is close to x* and 
(ii) the unit step length is eventually accepted. We prove the inexact proximal 
Newton method (i) converges g-linearly if the forcing terms r\ k are smaller than 
some fj, and (ii) converges q-superlinearly if the forcing terms decay to zero. 

First, we prove a consequence of the smoothness of g. Then, we use this 
result to prove the inexact proximal Newton method converges locally subject 
to standard assumptions on g and r\ k . 

Lemma 3.7. Suppose g is locally strongly convex and V 2 g is locally Lipschitz 
continuous. If x k sufficiently close to x* , then for any x, 

\\ x ~ slvM**) ^ (! + £ ) IN - x *Wv* g ( Xk ) ■ 
Proof. We first expand V 2 g(x*) 1 ^ 2 {x — x*) to obtain 
V 2 g{x*) 1/2 {x-x*) 

= (y 2 g{x*f 2 - V 2 ff (^) 1/2 ) (x - x*) + V 2 g{x k f' 2 (x - x*) 

= (V 2 <?(**) 1/2 - V 2 ff (z fc ) 1/2 ) V 2 g{x k )- 1 / 2 V 2 g{x k fl 2 (x - x*) 

+ V 2 g(x k ) 1 / 2 (x-x*) 
= (j + (V 2 g(x*) 1/2 - V 2 g{x k f/ 2 ) V 2 g(x k )-^) V 2 g{x k ) 1/2 {x x*). 

We take norms to obtain 

\\ x - x *\\v*g(x*) 

< || J + (V 2 g(xn 1/2 - V 2 fl (**) 1/2 ) VM**)- 1/2 || \\* ~ **\\v. gM • 



15 



If g is locally strongly convex with constant m and x k is sufficiently close to x* , 
then 

||V 2 9 (<) 1/2 -V 2 5(tffc) 1/2 || <V^e- 
We substitute this bound into (|3.6|) to deduce that 

Ik " z*llv M x*) < (1 + e) \\x - x*\\ v2g{xk) . 

□ 

Theorem 3.8. Suppose (i) g is locally strongly convex with constant m, (ii) 
V 2 <7 is locally Lipschitz continuous with constant L2, and (Hi) there exists Lq 
such that the composite gradient step satisfies 

\\ G f( x k)\\ V 2 g{Xk) -i < L G \\x k - x*\\ v2g(xk) . (3.7) 

!■ If Vk is smaller than some fj < j^, then an inexact proximal Newton 
method converges q-linearly to x* . 

2. If rj k decays to zero, then an inexact proximal Newton method converges 
q-superlinearly to x* . 

Proof. We use Lemma [3~7l to deduce 

\\xk+i - a;*Hv 2 9 (**) \\v g (x*) < (1 + ei) IN - x*\\ v2g(xk) . (3.8) 

We use the monotonicity of dh to bound \\x k — x *\\v 2 g(x k )- First, Aa:* satisfies 

W 2 g(x k )(x k+1 - x*) G -Vg(x k ) - dh(x k+1 ) + r k (3.9) 

(cf. (|2.23p ). Also, the exact proximal Newton step at x* (trivially) satisfies 

V 2 g{x k )(x* - x*) G -Vg(x*) - dh(x*). (3.10) 

Subtracting (|3.10[) from (13.91) and rearranging, we obtain 

dh(x k+ i) — dh(x*) 

G V 2 g(a; fe )(a; fe - x fc+ i - x* + x*) - Vg(x k ) + Vg(x*) + r k . 

Since dh is monotone, 

< (x k+1 - x*f(dh(x k+1 ) - dh(x*)) 
= (xfc+i - x*)V 2 g{x k ){x* ~ x k+1 ) + (xfc+i - x*) T (V 2 g(x k ){x k - a;*) 

-Vff(a*) + Vg(x*) + r k ) 
= - x*) T V 2 3 (x fe ) (ajfc - + V 2 ^^)- 1 (Vff(a:*) - V. 9 (a; fc ) + r fe )) 

-\\x k+ i-x*\\ v2g{xk) . 
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We taking norms to obtain 

lkfe+i " **llv» fl (x») ^ II s * " x * + V 2 5(^fc) _1 (V.g(x*) - Vg(x k ))\\ v2g(xk) 

+ Vk Il r fc|lv2 g (xi ; )- 1 ' 

If Xk is sufficiently close to x*, for any 62 > we have 

\\x k -x* + V 2 g(x fe ) _1 (Vg(x*) - V£/(x fe ))|| v2ff(xs!) < e 2 \\x k - x*\\ Vg2 ^ } 
and hence 

IK+i - x*\\ v2g{x ^ ||v= s (x*) < £2 Ikfc - aj*||v 9 2(x*) + Vk ||r fc || v2ff(xfc) -i . (3.11) 
We substitute (l3~TTj) into (13~51) to obtain 

||x fc+ i - v^c**) < (! + £ i)( £ 2 - z*||v 9 =(x*) + m \\rk\\^ g(xk) -i)- (3.12) 
Since Ax& satisfies the adaptive stopping condition (|2.21|) . 

\\ r k\\v*g(x k )-i < Vk \\Gf{x k )\\ v2g(xkyl , 

and since there exists Lq such that G/ satisfies p. 71) . 

IWIv^K)" 1 < Ikfc - a*Hv<r»(**) • ( 3 - 13 ) 
We substitute ([3~T3]) into p~T2]) to obtain 

Ik/c+i - Z*|| V = 9 (x*) < C 1 + £ l)( £ 2 + %^g) IN - 2*Hvs=(x*) • 
If r/fe is smaller than some 

^^((TT^- 62 )^ 

then Xfc converges q-linearly to x*. If r\ k decays to zero (the smoothness of g 
lets ei , t2 decay to zero) , then x k converges q-superlinearly to 1*. □ 

If we assume g is twice continuously diffcrcntiable, we can derive an expres- 
sion for Lq. Combining this result with Theorem l3.8l we deduce the convergence 
of an inexact Newton method with our adaptive stopping condition (|2.21l) . 

Lemma 3.9. Suppose g is locally strongly convex with constant m, and V 2 g is 
locally Lipschitz continuous. If Xk is sufficiently close to x* , there exists k such 
that 

\\Gf(xk)\\ v2g{xh) -i < (V^(l + £ ) + ^) II** " x *Wvg(x k ) ■ 
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Proof. Since G/(x*) is zero, 

W G f( X k)\\ V 2 g{xk] -l 

<-±=\\G f (x k )-G f (x*) 



< -i= ||V 5 (x fc ) - Vg{x*)\\ + -L \\x k - x*\ 



< VZ\\Vg(x k ) - Vg(x*)\\ V 2 g(xh) -i + ~ \\x k - x*\\ v2g{Xk) , (3.14) 



where K = L^/m. The second inequality follows from Lemma 12.11 We split 
||Vp(xk) - Vg(x*)\\ v2g{xk) _ 1 into two terms: 

\\Vg(x k ) - Vg(x*)\\ v2g{Xk) -i 

= \\Vg{x k ) -Vg(x*) +V 2 g(x k )(x* - x k )\\ v2g(xk) + \\x k - x*\\ V 2 g[Xk) ■ 

If x k is sufficiently close to x*, for any e\ > we have 

||V5(x fc ) - Vg(x*) + V 2 g{x k ){x* - a^OH^^-i < ei ||x fc - x*|| v = ff (x fc ) • 

Hence 

||V<?(x fe ) - Vg(x*)\\ V 2 g{xk) -i < (1 + ei) ||a* - *1 v * ff (**) • 
We substituting this bound into (|3.14|) to obtain 

\\ G f( x k)\\ v * g ( Xk )-i ^ (>/«(! + £ i) + ^) IK - ■ 
We use Lemma 13771 to deduce 



< (W + e) + ^) ||a*-*lv» fl( »*) 



□ 



Corollary 3.10. Suppose (i) g is locally strongly convex with constant m, and 
(ii) V 2 <7 is locally Lipschitz continuous with constant L2. 

1. If rj k is smaller than some fj < ^jr/^ > an inexact proximal Newton 
method converges q-linearly to x* . 

2. If n k decays to zero, an inexact proximal Newton method converges q- 
superlinearly to x* . 
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Remark 1. In many cases, we can obtain tighter bounds on \\Gf(xk)\\y2 g r Xk \-i- 
E.g., when minimizing smooth functions (h is zero), we can show 

W G f( x k)\\ V 2 g{xk) -i = ||V 5 (x fe )|| v2s(xfc) -i < (1 + e) \\x k - x*\\ v2g{xk) . 

This yields the classic result of Dembo et al.: if r\k is uniformly smaller than 
one, then the inexact Newton method converges g-linearly. 

Finally, we justify our choice of forcing terms: if we choose rjk according to 
(|2.22[) . then the inexact proximal Newton method converges g-superlinearly. 

Theorem 3.11. Suppose (i) xq is sufficiently close to x* , and (ii) the assump- 
tions of Theorem \3.S\ are satisfied. If we choose rjk according to (|2.22p . then the 
inexact proximal Newton method converges q-superlinearly. 



Proof. Since the assumptions of Theorem 13.31 are satisfied, Xk converges locally 
to x* . Also, since V 2 g is Lipschitz continuous, 

\\Vg(x k ) - Vg{x k -\) - V 2 g{x k ~i)Axk-i\\ 



< 



W^gixk-! + sAx k -!) -W 2 g{x*)\\ds^j \\Ax k ^\ 



\V 2 g(x*)- V 2 g(xk-i)\\ \\Ax k -i\\ 
<[ / L 2 \\x k -i + sAx k -i - x*\\ds) \\Ax k -i\ 



+ L 2 \\x k -i -x*\\ \\Ax k -i\\. 
We integrate the first term to obtain 

1 L 2 
L 2 \\xk-i + sAxk-i - x*\\ds = L 2 \\x k -i - x*\\ + — \\ Ax k -i\\ ■ 
o z 

We substituting these expressions into (|2 .22[) to obtain 

VU<L 2 (2||x fc _ 1 -,1| + i||Ax,_ 1 ||) 1| Mg^. (3.15) 

If Vg(x*) 7^ 0, ||Vg(a;)|| is bounded away from zero in a neighborhood of x* . 
Hence rjk decays to zero and Xk converges g-superlinearly to x*. Otherwise, 

||V.9(x fe _i)|| = ||V.g(x fe _i) - Vg(x*)\\ > m \\x k -i -x*\\. (3.16) 

We substitute (|3TT5|) and (|3~TC)) into (ET2"2"]) to obtain 

Vk< — 2 \\Xk-i -x \\ + — . (3.17) 

m \ 2 J \\Xk-i - x *\\ 

The triangle inequality yields 

IIAifc-iH < \\x k -x*\\ + \\x k -i-x*\\. 
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We divide by H^-i — x*\\ to obtain 

l|Aa; fc -i|| < 1 | \\x k -x*\\ 
\\x k -i - X*\\ ~ \\Xk-l - x*\\ ' 

If k is sufficiently large, x k converges g-linearly to x* and hence 

||Ao: fc -i|| < 2 

We substitute this expression into ()3. 17|> to obtain 

Vk <—^\\x k -i-x*\\ + ||Aa; fe _i||). 
m 

Hence r)k decays to zero, and Xk converges g-superlinearly to x*. □ 

4 Computational experiments 

First we explore how inexact search directions affect the convergence behavior 
of proximal Newton-type methods on a problem in bioinfomatics. We show 
that choosing the forcing terms according to (|2.22[) avoids "oversolving" the 
subproblem. Then we demonstrate the performance of proximal Newton-type 
methods using a problem in statistical learning. We show that the methods are 
suited to problems with expensive smooth function evaluations. 

4.1 Inverse covariance estimation 

Suppose we are given i.i.d. samples x^ 1 ' , . . . , i" from a Gaussian Markov ran- 
dom field (MRF) with unknown inverse covariance matrix O: 

Pr(a;; 6) cx exp(x T ex/2 - log clet (8)). 

We seek a sparse maximum likelihood estimate of the inverse covariance matrix: 

8 := argmin tr (to) - logdet(8) + A ||vec(6)|| 1 , (4.1) 

where £ denotes the sample covariance matrix. We regularize using an entry- 
wise l\ norm to avoid overfitting the data and promote sparse estimates. A is a 
parameter that balances goodness-of-fit and sparsity. 

We use two datasets: (i) Estrogen, a gene expression dataset consisting of 
682 probe sets collected from 158 patients, and (ii) Leukemia, another gene 
expression dataset consisting of 1255 genes from 72 patients^ The features of 
Estrogen were converted to log-scale and normalized to have zero mean and unit 
variance. A was chosen to match the values used in [19] . 

1 These datasets are available from http://www.math.nus.edu.sg/-mattohkc/ with the 
SPINCOVSE package. 
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Figure 1: Inverse covariance estimation problem (Estrogen dataset). Conver- 
gence behavior of proximal BFGS method with three subproblem stopping con- 
ditions. 
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Figure 2: Inverse covariance estimation problem (Leukemia dataset). Con- 
vergence behavior of proximal BFGS method with three subproblem stopping 
conditions. 



We solve the inverse covariance estimation problem (|4.1[) using a proximal 
BFGS method, i.e. Hk is updated according to the BFGS updating formula. To 
explore how inexact search directions affect the convergence behavior, we use 
three rules to decide how accurately to solve subproblem (|2.5|) : 



1. adaptive: stop when the adaptive stopping condition (|2.21j) is satisfied; 

2. exact: solve subproblem exactly; 

3. inexact: stop after 10 iterations. 

We plot relative suboptimality versus function evaluations and time on the 
Estrogen dataset in Figure [T] and the Leukemia dataset in Figure O 

On both datasets, the exact stopping condition yields the fastest conver- 
gence (ignoring computational expense per step), followed closely by the adap- 
tive stopping condition (see Figure [1] and [2]) . If we account for time per step, 



21 




1000 2000 3000 4000 5000 100 200 300 400 500 

Function evaluations Time (sec) 



Figure 3: Logistic regression problem (gisette dataset). Proximal L-BFGS 
method (L = 50) versus FISTA and SpaRSA. 

then the adaptive stopping condition yields the fastest convergence. Note that 
the adaptive stopping condition yields superlinear convergence (like the exact 
proximal BFGS method). The third (inexact) stopping condition yields only 
linear convergence (like a first-order method), and its convergence rate is af- 
fected by the condition number of 0. On the Leukemia dataset, the condition 
number is worse and the convergence is slower. 

4.2 Logistic regression 

Suppose we are given samples x^\ . . . , x^ with labels y' 1 ', . . . , y^ n > € {0, 1}. 
We fit a logit model to our data: 

1 " 

minimize — log(l + exp(— yiW T Xi)) + A |j w\\ , . (4-2) 
uEll" n 

i=i 

Again, the regularization term \\w\\-, promotes sparse solutions and A balances 
goodness-of-fit and sparsity. 

We use two datasets: (i) gisette, a handwritten digits dataset from the 
NIPS 2003 feature selection challenge (n = 5000), and (ii) rcvl, an archive of 
categorized news stories from Reuters (n = 47,000)0 The features of gisette 
have be scaled to be within the interval [—1,1], and those of rcvl have be scaled 
to be unit vectors. A was chosen to match the value reported in [28) . where it 
was chosen by five- fold cross validation on the training set. 

We compare a proximal L-BFGS method with SpaRSA and the TFOCS 
implementation of FISTA (also Nesterov's 1983 method) on problem (|4.2[) . We 
plot relative suboptimality versus function evaluations and time on the gisette 
dataset in Figure [3] and on the rcvl dataset in Figure 2] 

The smooth part requires many expensive exp/log operations to evaluate. 
On the dense gisette dataset (30 million nonzero entries in a 6000 by 5000 

2 These datasets are available from http : //www. csie .ntu. edu. tw/~ cjlin/libsvmtools/datasets 
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Figure 4: Logistic regression problem (rcvl dataset). Proximal L-BFGS method 
(L = 50) versus FISTA and SpaRSA. 

design matrix), evaluating g dominates the computational cost. The proximal 
L-BFGS method clearly outperforms the other methods because the computa- 
tional expense is shifted to solving the subproblems, whose objective functions 
are cheap to evaluate (see Figure [3]). On the sparse rcvl dataset (40 million 
nonzero entries in a 542,000 by 47,000 design matrix), the cost of evaluating g 
makes up a smaller portion of the total cost, and the proximal L-BFGS method 
barely outperforms SpaRSA (see Figured]). 

5 Conclusion 

Given the popularity of first-order methods for minimizing composite functions, 
there has been a flurry of activity around the development of Newton-type 
methods for minimizing composite functions |12l ® I16) . We analyze proximal 
Newton-type methods for such functions and show that they have several strong 
advantages over first-order methods: 

1 . They converge rapidly near the optimal solution, and can produce a solu- 
tion of high accuracy. 

2. They are insensitive to the choice of coordinate system and to the condi- 
tion number of the level sets of the objective. 

3. They scale well with problem size. 

The main disadvantage is the cost of solving the subproblem. We have shown 
that it is possible to reduce the cost and retain the fast convergence rate by 
solving the subproblems inexactly. We hope our results kindle further interest 
in proximal Newton-type methods as an alternative to first-order and interior 
point methods for minimizing composite functions. 
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A Proofs 

Lemma 13.41 Suppose g is twice continuously differentiable and V 2 g is locally 
Lipschitz continuous with constant Li. If {-fffc} satisfy the Dennis-More crite- 
rion (|3.1[) and their eigenvalues are bounded, then the unit step length satisfies 
the sufficient descent condition (I2.16|) after sufficiently many iterations. 

Proof. Since V 2 g is locally Lipschitz continuous with constant L2, 

g(x + Ax) < g(x) + V g{x) T Ax + ^Ax T V 2 g(x)Ax + -y || Ax|| 3 . 

We add h(x + Ax) to both sides to obtain 

f(x + Ax) < g(x) + Vg{x) T Ax + ^Ax T V 2 g(x)Ax 

+ — \\Ax\\ 3 + h(x + Ax). 
6 

We then add and subtract h(x) from the right-hand side to obtain 

f(x + Ax) < g(x) + h(x) + Vg(x) T Ax + h(x + Ax) - h(x) 

+ -Ax T V 2 g{x)Ax + — \\Ax\\ 3 
2 6 

< f{x) + A + ^Ax T V 2 g(x)Ax + y ||As|| 3 

< f{x) + A + )-Ax T V 2 gix)Ax + — - \\Ax\\ A, 
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where we use (|2.11[) . We add and subtract 



±Ax T HAx to yield 




2 6m 




(A.1) 



where we again use (|2.11| . V 2 <? is locally Lipschitz continuous and Ax satisfies 
the Dennis-More criterion. Thus, 



We can show Axk converges to zero via the argument used in the proof of 
Theorem 13. II Hence, for k sufficiently large, f(xk + Axk) — f{xk) < sAfc. □ 
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